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Abstract. Given a sample covariance matrix, we solve a maximum likelihood problem penalized 
by the number of nonzero coefficients in the inverse covariance matrix. Our objective is to find a 
sparse representation of the sample data and to highlight conditional independence relationships 
between the sample variables. We first formulate a convex relaxation of this combinatorial problem, 
we then detail two efficient first-order algorithms with low memory requirements to solve large-scale, 
dense problem instances. 

1. Introduction. We discuss a problem of model selection 1 . Given n variables 
drawn from a Gaussian distribution Af(0,C), where the true covariance matrix C 
is unknown, we estimate C from a sample covariance matrix £ by maximizing its 
log-likelihood. Following I?], setting a certain number of coefficients in the inverse 
covariance matrix to zero, a procedure known as covariance selection, improves 
the stability of this estimation procedure by reducing the number of parameters to 
estimate and highlights structure in the underlying model. 

Here, we focus on the problem of discovering this pattern of zeroes in the inverse 
covariance matrix. We seek to trade-off the log-likelihood of the solution with the 
number of zeroes in its inverse, and solve the following estimation problem: 

, . maximize log det X — (S, X) — p Card(A") 

^ ' ' subject to al n -< X -< (31 n 

in the variable X <E S n , where £ G S^ is the sample covariance matrix, Card(X) is 
the cardinality of X, i.e. the number of nonzero components in X , p > is a parameter 
controlling the trade-off between log-likelihood and cardinality, finally a, f3 > fix 
bounds on the eigenvalues of the solution. 

Zeroes in the inverse covariance matrix correspond to conditionally independent 
variables in the model and this approach can be used to simultaneously determine 
a robust estimate of the covariance matrix and, perhaps more importantly, discover 
structure in the underlying graphical model. In particular, we can view as a 

model selection problem using Aikake (AIC, see PO) or Bayes (BIC, see jHJ) information 
criterions. Both these problems can be written as in with p = 2/N for the AIC 
problem and p = 2log(N/2)/N for the BIC problem, where N is the sample size. 
This has applications in speech recognition (see 012) or gene networks analysis (see 
E] for example) . 

The Card(X) penalty term makes the estimation problem combinatorial 
(NP-Hard in fact), and our first objective here is to derive a convex relaxation to this 
problem which can be solved efficiently. We then derive two first-order algorithms 
geared towards memory efficiency and large-scale, dense problem instances. 

In Pj, Bilmes proposed a method for covariance selection based on choosing sta- 
tistical dependencies according to conditional mutual information computed using 
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training data. Other recent work involves identifying those Gaussian graphical mod- 
els that are best supported by the data and any available prior information on the 
covariance matrix. This approach is used by on gene expression data. Recently, 

[HJ I12| also considered penalized maximum likelihood estimation for covariance selec- 
tion. In contrast to our results here, ^2] work on the Cholesky decomposition of X 
using an iterative (heuristic) algorithm to minimize a nonconvex penalized likelihood 
problem, while 6 propose a set of large scale interior point algorithms to solve sparse 
problems, i.e. problems for which the conditional independence structure is already 
known. 

The paper is organized as follows, in Section |2 we detail our convex relaxation 
of problem and study the dual. In Section |3 we derive two efficient algorithms 
to solve it. Finally, in Section^we describe some numerical results. 

2. Problem setup. 

2.1. Convex relaxation. Given a sample covariance matrix E € S„, we can 
write the following convex relaxation to the estimation problem : 

(2 1) maximize log det X — (E, X) — pl T |X|l 

^ ' ' subject to al n -< X -< (31 n , 

with variable le S n , where 1 is the n- vector of ones, so that 1 T |A|1 = Y^iij=i 
The penalty term involving the sum of absolute values of the entries of X is a proxy for 
the number of its non-zero elements: the function 1 T |A|1 can be seen as the largest 
convex lower bound on Caxd(X) on the hypercube, an argument used by for 
rank minimization. It is also often used in regression techniques, such as the LASSO 
studied by , when sparsity of the solution is a concern. This relaxation is provably 
tight in certain cases (see JO])- I n our model, the bounds (a,/3) on the eigenvalues of 
X are fixed, and user-chosen. Although we allow a = 0, /3 = +oo, such bounds are 
useful in practice to control the condition number of the solution. 

For p — 0, and provided E >- 0, problems and i|2.1[l have a unique solution 
X* = S -1 , and the corresponding maximum-likelihood estimate is E. Due to noise in 
the data, in practice, the sample estimate E may not have a sparse inverse, even if the 
underlying graphical model exhibits conditional independence properties. By striking 
a trade-off between maximality of the likelihood and number of non-zero elements 
in the inverse covariance matrix, our approach is potentially useful at discovering 
structure, precisely conditional independence properties in the data. This means that 
we have to focus on the case where the matrix X is dense. At the same time, it 
serves as a regularization technique: when E is rank-deficient, there is no well-defined 
maximum-likelihood estimate, whereas the solution to problem Ij2.1|l is always unique 
and well-defined for p > 0, as seen below. 

2.2. Dual problem, robustness. We can rewrite the relaxation l|2.1[l as the 

following min-max problem: 

(2.2) max min log det X - (E + U, X) 

{X: ctI„^X^/3I„} {U: \Uij\<p} 

which gives a natural interpretation of problem (|2.1() as a worst-case robust maxi- 
mum likelihood problem with componentwise bounded, additive noise on the sample 
covariance matrix E. The corresponding Lagrangian is given by: 



L(X, U,P, Q) = log det A - Tr((E + U + Q — P)X) - aTr P + (3TrQ 
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and we get the following dual to (|2.1[1 : 



, , minimize - logdet(E + U + Q - P) - n + a TV P - (3 Tr Q 

subject to P, Q >: 0, |J7y|<p, i,j — \,...,n, 

in the variables U, P,Q € S n - 

When a = and /3 = +oo, the first-order optimality conditions impose X(E + 
J7) = In, hence we always have: 

1 

X >z a(n)I n with a(n) := ttt 



j|| + np 

zero duality gap also means Tr(EX) = n - pl T \X\l. Because X and E are both 
positive semidefinite, we get: 



n 



IXllo < IIXIIp < 1 T |X|1 < - 



which, together with Tr(SX) > A min (E)||X|| 2 , means ||X|| 2 < n/A min (S). Finally 
then, we must always have: 

X<(3(n)I n with 0(n) :=nmin( i,||E _1 || 2 

VP 

and < a(n) < X(X) < /3(n) < +oo at the optimum. Setting a — and (3 = +oo in 
problem l|2.1|l is then equivalent to setting a — a(n) and /3 = (3{n). Since the objective 
function of problem (|2.1() is strictly convex when < a(n) < X(X) < (3(n) < +oo, 
this shows that l|2.1|l always has a unique solution. 

3. Algorithms. In this section, we present two algorithms for solving problem 
1)2. one based on an optimal first-order method developed in JHj; the other based 
on a block-coordinate gradient method. 

Of course, problem (|2.1(l is convex and can readily be solved using interior point 
methods (see 0] for example). However, such second-order methods become quickly 
impractical for solving since the corresponding complexity to compute an e- 

suboptimal solution is 0(n 6 log(l/e)). Note however that we cannot expect to do 
better than 0(n 3 ), which is the cost of solving the non-penalized problem for dense 
covariance matrices E. 

3.1. Smooth optimization. The recently-developed first-order algorithms due 
to |18| trade-off a better dependence on problem size against a worst dependence 
on accuracy, usually 1/e instead of its logarithm and the method we describe next 
has a complexity of 0(n 4 5 /e). In addition, the memory space requirement of these 
first-order methods is much lower than that of interior-point methods, which involve 
forming a dense Hessian, and hence become quickly prohibitive with a problem having 
0(n 2 ) variables. 

Nesterov 's format. The algorithm in |18| supposes that the function to minimize 
conforms to a certain representation. This is the case for our problem here, so we first 
write H2.2|) in the saddle- function format described in |18| : 

min -logdetX + (E,X) + pl T \X\l = min max f(X) + {A(X),U) 

XgQi ^GQi UGQ2 

where we define f(X) = — logdetX + (E, X), A = pl n 2, and 

Qi := {XeS n : al„ r< X X /3I n } , Q 2 := {U G S n : ||[/||co < 1} . 



The adjoint of this problem, corresponding to the dual problem (|2.3[1 . is then written: 



(3.1) max <j>{U) where </>(U) := min -logdetX + (£ + U,X). 

When a function can be represented in this saddle function format, the method de- 
scribed in |18| combines two steps. Regularization: by adding a strongly convex 
penalty to the saddle function representation of /, the algorithm first computes a 
smooth e-approximation of / with Lipschitz continuous gradient. This can be seen as 
a generalized Moreau-Yosida regularization step (see ^1] for example). Optimal first 
order minimization: the algorithm then applies the optimal first-order scheme for 
functions with Lipschitz continuous gradients detailed in JH] to the regularized func- 
tion. Each iteration requires efficiently computing the regularized function value and 
its gradient. In all the semidefinite programming applications detailed here, this can 
be done extremely efficiently, with a complexity of 0(n 3 ) and memory requirements 
in 0(n 2 ). The method is only efficient if all these steps can be performed explicitly 
or at least very efficiently. As we will see below, this is the case here. 

Prox-functions and related parameters. To Qx and Q2 we now associate norms 
and so-called prox-functions. For Qx, we use the Frobenius norm, and a prox-function: 

d^X) = -logdetX + log/3. 

The function d\ is strongly convex on Qx, with a convexity parameter of <j\ = 1//3 2 , 
in the sense that V 2 di (Y)[ff, ff] = ^{X^HX^H) > /3- 2 \\H\\ 2 F for every H. Fur- 
thermore, the center of the set, Xq := argmin^ggj d\{X) is Xq = f31 n , and satisfies 
di(X ) = 0. With our choice, we have D\ := max^ed d\{X) = nlog(/3/a). 

To Q2, we also associate the Frobenius norm, and the prox-function d2(U) = 
||C/|| 2 7 /2. With this choice, the center Uq of Q2 is Uo = 0. Furthermore, the function 
c?2 is strongly convex on its domain, with convexity parameter with respect to the 
1-norm o~\ = 1, and we have D2 := maxy e g 2 d%{U) = n 2 /2. 

The function / has a gradient that is Lipschitz-continuous with respect to the 
Frobenius norm on the set Qx, with Lipschitz constant M = 1/a 2 . Finally, the norm 
(induced by the Frobenius norm) of the operator A is ||A|| = p. 

Smooth minimization. The method is based on replacing the objective of the 
original problem, f(X), with f e (X), where e > is the desired accuracy, and f e is a 
penalized function involving the prox-function c?2, defined as: 

(3.2) f e (X) := f(X) + max (X, U) - {e/2D 2 )d 2 (U). 

The above function turns out to be a smooth uniform approximation to / everywhere, 
with maximal error e/2. Furthermore, the function f e is has a Lipschitz-continuous 
gradient, with Lipschitz constant given by L(e) := M + D2\\A\\ 2 / (2a 2 e). A specific 
first-order algorithm detailed in for smooth, constrained convex minimization is 
then applied to the function f e , with convergence rate in 0(y/L(e)/e). 

Nesterov's algorithm. Choose e > and set Xq — /3I n , the algorithm then updates 
primal and dual iterates Yfe and Uk using the following steps: 

1. Compute V/ e (X fe ) = -X- 1 + £ + U*(X k ), where U*(X) solves (|3~2l . 

2. Find Y k = argmin yeQl {(V/ e (X fe ), Y - X k ) + \L{e)\\Y - X k \\ 2 F } 

3. Find Z k - argmin ZeQi {M^i + ^ i±L(f e ( Xi ) + (V/ £ (X 4 ), Z - X,))} 

4. Update X k = ^Z k + £±Y k and U k = 



5. Repeat until the duality gap is less than the target precision: 

-logdetn + (E,F fc ) +pl T |r fe |l - <f>(U k ) < e. 

The key to the method's success is that step 1-3 and 5 can be performed explicitly 
and only involve an eigenvalue decomposition. Step one above computes the (smooth) 
function value and gradient. The second step computes the gradient mapping, which 
matches the gradient step for unconstrained problems (see ^3 p.86]). Step three and 
four update an estimate sequence [171 p. 72] of whose minimum can be computed 
explicitly and gives an increasingly tight upper bound on the minimum of / M . We 
now present these steps in detail for our problem. 

Step 1. The first step requires computing the gradient of the function 

f e (X) = f(X) + max (X, U) - (e/2D 2 )d 2 (U). 

u£Q 2 

This function can be expressed in closed form as f e (X) = f(X) + J2i j VvP^j')' where 

^ (x) . = { M - (e/4X>a) if \x\ > (e/2pD 2 ), 
^ ' ' y D2X 2 /e otherwise, 

which is simply the Moreau-Yosida regularization of the absolute value and the gra- 
dient of the function at X is 

vf^x) = -x- 1 + ^ + U*{X) 1 

with 

U*(X) := max(min(X//z, p), — p), 

with min. and max. understood componentwise. The cost of this step is dominated 
by that of computing the inverse of X, which is 0(n 3 ). 
Step 2. This step involves a problem of the form 

T Ql (X) = arg min (Vf e (X),Y - X) + \l\\Y - X\\%, 
y t y 1 ^ 

where X 6 Qi is given. This problem can be reduced to one of projection on Qi, 
namely 



1 .( ^ 7 

YeQ 



min \\Y-G\\ F , 



where G := X — L 1 Wf e (X). Using the rotational invariance of this problem, we 
reduce it to a vector problem: 

min A Si(Aj - n) 2 : a < Xi < f3, i=l,...,n, 

where 7 is the vector of eigenvalues of G. This problem admits a simple explicit 
solution: 

A, = min(max(7i, a), /?), i = l,...,n. 

The corresponding solution is then Y = V T d'mg(A)V, where G = V T d'mg(j)V is 
the eigenvalue decomposition of G. The cost of this step is dominated by the cost of 
forming the eigenvalue decomposition of G, which is 0(n 3 ). 
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Step 3. The third step involves solving a problem of the form 

(3.3) Z := arg max d x (X) + (S, X), 

xe Qi 

where S is given. Again, due to the rotational invariance of the objective and feasible 
set, we can reduce the problem to a one-dimensional problem: 

min A J2i a i^i ~ l°g A, : a < Xi < (3, 

where a contains the eigenvalues of S. This problem has a simple, explicit solution: 

Xi = min(max(l/tTi, a), (3), i = 1, . . . , n. 

The corresponding solution is then Y = V^ T diag(A)l^, where S — V T di&g(a)V is the 
eigenvalue decomposition of S. Again, the cost of this step is dominated by the cost 
of forming the eigenvalue decomposition of 5, which is 0(n 3 ). 

Computing 4>{Uk)- For a given matrix Uk the function 4> is computed as in l|3.1() : 

4>(Uk)= min -logdetA+ (S + L> fc ,A). 

This means projecting (S + Uk)^ 1 on Qx, i.e. only involves an eigenvalue decompo- 
sition. 

Complexity estimate. To summarize, for step 1, the gradient of f e is readily com- 
puted in closed form, via the computation of the inverse of X. Step 2 essentially 
amounts to projecting on Qi, and requires an eigenvalue problem to be solved; like- 
wise for step 3. In fact, each iteration costs 0(n 3 ). The number of iterations necessary 
to achieve an objective with absolute accuracy less than e is then given by: 

(3.4) N(e) := i\\A\\-J^-^ + J l - = V V S ' (Anap+^e), 

e V o-\U2 \ crie e 

where k = log(/3/a) bounds the solution's condition number. Thus, the overall com- 
plexity when p > is in 0(n 4 5 /e), as claimed. 

3.2. Block-coordinate gradient methods. In this section, we focus on the 
particular case where a = and (3 = +oo (hence implicitly a — ct(n), (3 — f3(n)) and 
derive gradient minimization algorithms that take advantage of the problem structure. 
We consider the following problem: 

(3.5) max logdetA - (S,X) -pl T |A|l 

in the variable X £ S n , where p > again controls the trade-off between log-likelihood 
and sparsity of the inverse covariance matrix. Its dual is given by: 

/„ „n minimize — logdet(E + U) — n 

subject to \Uij\ < p, i,j=l,...,n. 

in the variable U S S n . We partition the matrices X and U in block format: 

X=( Z T X ) and U=( \ U ), 
\ x y J \ u w J 
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where Z >- and U are fixed and x, u £ R'™ -1 *', i/, to 6 R are the variables (row 
and column) we are updating. We also partition the sample matrix according to the 
same block structure: 



E = 



A b 
b T c 



where A £ S( n _i), b £ R*-™ -1 ^, c £ R. In the methods that follow, we will update 
only one column (and corresponding row) at a time and without loss of generality we 
can always assume that we are updating the last one. 
Block- coordinate descent. The dual problem l|3.6fl : 

minimize — logdet(E + U) — n 
subject to \Uij \ < p, i,j = l,...,n. 

in the variable U £ S n , can be written in block format as: 

minimize - logdet(A + V) - log ((to + c) - (b + u) T (A + Vy 1 (b + u)) -n 
subject to |to| < p, \ui\ < p. i = 1, . . . ,n. 

in the variables u £ R*™ -1 ' and w £ R (V is fixed at each iteration). We directly 
get w — p so the diagonal of the optimal solution must be pi. The main step at each 
iteration is then a box constrained quadratic program (QP): 

, . minimize (b + u) T (A + V)^ 1 (b + u) 

subject to \ui\ < p, i — 1, . . . ,n, 

in the variable u £ R^" -1 -*. To summarize, the block coordinate descent algorithm 
proceeds as follows: 

1. Pick the row and column to update. 

2. Compute (A + V)- 1 . 

3. Solve the box constrained QP in l|3.7|l . 

4. Repeat until duality gap is less than precision: (E, X) — n + pl T \X\l < e. 

At each iteration, we need to compute the inverse of the submatrix (A + V) £ S(„_!), 
but we can update this inverse using the Sherman- Woodbury-Morrison formula on 
two rank-two updates, hence it is only necessary to compute a full inverse at the first 
iteration. 

Block- coordinate ascent. For a fixed Z, problem (|3.5(l is equivalent to: 

maximize log (y — x T Z~ x x) — 2b T x — y(c + p) — 2p\\x\\ i 
subject to y — x T Z~ 1 x > 0, y > 0, 

in the variables x £ R^™ -1 -*, y £ R, where Z y (given) and the Schur complement 
constraints imply X >~ 0. We can solve for the optimal y explicitly and the problem 
in x becomes: 

max— x T Qx — 2b T x — 2p||a;||i, 

X 

where Q := (c + p)Z~ 1 . Its dual is also box-constrained QP: 

minimize (b + u) T Z(b + u) 
subject to ||w||oo < P, 
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in the variable u £ R 1 -" 1 ' . At the optimum for this QP, we must have: 

x = —-. — - — -Z(b + u), and y = - — - — - + 7 — ^——r(b + u) T Z(b + u). 
(c + p) (c + p) (c + p) 2V 

and we iterate as above. 

Smooth optimization for box- constrained QPs. The two block-coordinate methods 
detailed in this section both amount to solving a sequence of box-constrained quadratic 
program of the form: 

, . minimize x 1 Ax + b T x 

' subject to ||a;||oo < P, 

in the variable x € R™. The objective function has a Lipschitz continuous gradient 
with constant L = 4cA max (^4) yfn on the box B — {x € R™ : ||:c||oo < p}, where we 
can define a prox function (l/2)||x|| 2 which is strongly convex with constant 1 and 
bounded above by (l/2)n j o 2 on B. From ]TQ or |T2], we know that solving (|3.8|l up to 
a precision e will require at most 2n a75 -y/2 / o 3 A max (A) / ^Je iterations of the first-order 
method detailed in j!6) . with each iteration equivalent to a matrix- vector product and 
a projection on the box B. This means that the total complexity of solving l|3.8|) is 
given by: 



Complexity estimate. Following f5j, with block coordinate descent corresponding 
to coordinate descent with the almost cyclic rule and using the fact that logdet(X) 
satisfies the strict convexity assumptions in A2] , we can show that the convergence 
rate of the block coordinate descent method is at least linear. Each iteration requires 
solving a box-constrained QP and takes 0(n 3 log(l/e)) operations using an interior 
point solver or 0(n 2 ' 75 /i/e) using the optimal first-order scheme in ^B]. We cannot 
use the same argument to show convergence of block coordinate ascent but empirical 
performance is comparable. In practice we have found that a small number of sweeps 
through all columns, independent of problem size n, is sufficient for convergence. 

Implementation. The block coordinate descent methods implemented here cor- 
respond to coordinate descent using the almost cyclic rule, alternative row/column 
selection rules could improve the convergence speed. Also, each iteration of the block 
coordinate descent method corresponds to two rank-two updates of the inverse ma- 
trix, hence the cost of maintaining the inverse submatrix using the Sherman Woodbury 
Morrison formula is only 0(n 2 ). 

4. Numerical results. In this section we test the performance of the methods 
detailed above on some randomly generated examples. We first form a sparse matrix 
A with a diagonal equal to one and a few randomly chosen, nonzero off-diagonal terms 
equal to +1 or -1. We then form the matrix: 

B = A^ 1 + aV 

where V € S n is a symmetric, i.i.d uniform random matrix. Finally, we make B 
positive definite by shifting its eigenvalues, and use this noisy, random matrix to test 
our covariance selection methods. 
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Fig. 4.1. Recovering the sparsity pattern. We plot the original inverse covariance matrix A, 
the solution to problem 12. IV and the noisy inverse B~ 1 . 



In Figure 14.11 we plot the sparsity patterns of the original inverse covariance 
matrix A, the solution to problem H2.1(l and the noisy inverse B" 1 in a randomly 
generated example with n = 30, a = 0.15 and p = 0.5. In Figure l4~2l we represent the 
dependence structure of interest rates (sampled over a year) inferred from the inverse 
covariance matrix. Each node represents a particular interest rate maturity and the 
nodes are linked if the corresponding coefficient in the inverse covariance matrix is 
nonzero (i.e. they are conditionally dependent). We compare the solution to problem 
H2.1JI on this matrix for p — and p = 0.1 and notice that in the sparse solution the 
rates appear clearly clustered by maturity. 




Fig. 4.2. We plot the network formed using the solution to problem 12. IV on an interest rate 
covariance matrix for p = (left) and p = 0.1 (right). In the sparse solution the rates appear clearly 
clustered by maturity. 

In Figure 14.31 we study computing times for various choices of algorithms and 
problem sizes. On the left, we plot CPU time to reduce the duality gap by a factor 10~ 2 
versus problem size n, on randomly generated problems, using the coordinate descent 
code and the optimal first-order for solving box QPs. On the right, we plot duality 
gap versus CPU time for both smooth minimization and block-coordinate algorithms 
for a randomly generated problem of size n = 250. For the smooth minimization 
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code, we set a — 1/A max (i3) and we plot computing time for both (3 — l/(2A min (i?)) 
(Smooth. Opt. 1/2) and f3 = 2/X min (B) (Smooth. Opt. 2). 
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Fig. 4.3. Computing time. Left; We plot CPU time to reduce the duality gap by a factor 10 -2 
versus problem size n, on randomly generated problems, using the coordinate descent code and the 
optimal first-order algorithm for solving box QPs. Right; We plot duality gap versus CPU time for 
both smooth minimization and block- coordinate algorithms, for a problem of size n = 250. 
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